---
id: tutorial-metrics-selection
title: Selecting Your Metrics
sidebar_label: Selecting Your Metrics
---

Once you have a clearly defined evaluation criteria, selecting metrics becomes significantly easier. In some cases, you may find **existing metrics** in DeepEval that already match your criteria. In others, you'll need to create **custom metrics** to address your unique evaluation needs.

:::tip
DeepEval provides [30+ metrics](/docs/metrics-introduction) to help you evaluate your LLM. **Familiarizing yourself with these metrics** can help you choose the ones that best align with your evaluation criteria.
:::

## Selecting Metrics Relevant To Your Criteria

In this section, we’ll be selecting the **LLM evaluation metrics** for our medical chatbot based on the evaluation criteria we've established in the previous section. Let’s quickly revisit these criteria:

1. **Directly addressing the user:** The chatbot should directly address users' requests
2. **Providing accurate diagnoses:** Diagnoses must be reliable and based on the provided symptoms
3. **Providing professional responses:** Responses should be clear and respectful

### Answer Relevancy

Let's start with our first metric, which will evaluate our medical chatbot against our first criterion:

```
Criteria 1: The medical chatbot should address the user directly.
```

Currently, our chatbot sometimes fails to directly address user queries, instead taking the lead in the conversation—for example, asking for appointment details instead of focusing on diagnosing the patient. This results in responses that only tangentially address the user's input. To address this, we should be evaluating **how relevant the chatbot's responses are to the user query**.

To address this, you can leverage `deepeval`'s default [`AnswerRelevancyMetric`](/docs/metrics-answer-relevancy), which is available out-of-the-box and evaluates how relevant an LLM's output is to the input.

:::info
The `AnswerRelevancyMetric` uses an LLM to extract all statements from the `actual_output` and then classifies each statement's relevance to the `input` using the same LLM. You can read more on how each individual default metric is calculated by visiting their [individual metric pages.](/docs/metrics-answer-relevancy#how-is-it-calculated)
:::

### Faithfulness

Our next metric addresses the inaccuracies in patient diagnoses. The chatbot's failure to deliver accurate diagnoses in some example interactions suggests that our **RAG tool needs improvement**.

```
Criteria 2: The chatbot should provide accurate diagnoses based on the given symptoms.
```

This is because the RAG engine is responsible for **retrieving relevant medical information from our knowledge base** to support patient diagnoses. To address this, we need to evaluate specifically whether the information in the retrieved chunks actually align with the information in the actual output.

`deepeval`'s [`FaithfulnessMetric`](/docs/metrics-faithfulness) is well-suited for this task. It assesses the whether the `actual_output` factually aligns with the contents of the `retrieval_context`.

:::tip  
`deepeval` offers a total of **5 RAG metrics** to evaluate your RAG pipeline. To learn more about selecting the right metrics for your use case, check out this [in-depth guide on RAG evaluation](/guides/guides-rag-evaluation).
:::

### Professionalism

Our final metric will address Criterion 3, focusing on evaluating our chatbot's **professionalism**.

```
Criterion 3: The chatbot should provide clear, respectful, and professional responses.
```

Since `deepeval` doesn't natively support this evaluation criteria, we'll need to define our own custom `Professionalism` metric using `deepeval`'s custom metric [`G-Eval`](/docs/metrics-llm-evals), and that's OK. Defining custom metrics is nothing to be afraid of, and while `deepeval` offers a tons of default metrics that are ready to use out-of-the-box there are often times more use case specific definitions of a metric that requires more customization.

The professionalism metric here is a great example - what it means to be professional in one work setting can be drastically different from another and in our case, the custom professionalism metric we define will allow us to ensure that the chatbot maintains a professional tone typically expected in a medical setting.

:::note
G-Eval is a **custom metric framework** that enables users to leverage LLMs for evaluating outputs based on their own tailored evaluation criteria.
:::

Now that we've selected our three metrics, let's see how to implement them in code.

## Defining Metrics in DeepEval

To define our **Answer Relevancy**, **Contextual Relevancy**, and custom **G-Eval** metric for professionalism, you'll first need to install DeepEval. Run the following command in your CLI:

```bash
pip install deepeval
```

### Defining Default Metrics

Let's begin by defining the Answer Relevancy and Contextual Relevancy metrics, which is as simple as importing and instantiating their respective classes.

```python
from deepeval.metrics import (
    AnswerRelevancyMetric,
    ContextualRelevancyMetric
)

answer_relevancy_metric = AnswerRelevancyMetric()
contextual_relevancy_metric = ContextualRelevancyMetric()
```

### Defining a Custom Metric

Next, we'll define our custom G-Eval metric for professionalism. This involves specifying the name of the metric, the evaluation criteria, and the parameters to evaluate. In this case, we're only assessing the LLM's `actual_output`.

```python
from deepeval.test_case import LLMTestCaseParams
from deepeval.metrics import GEval

# Define criteria for evaluating professionalism
criteria = """Determine whether the actual output demonstrates professionalism by being
clear, respectful, and maintaining an empathetic tone consistent with medical interactions."""

# Create a GEval metric for professionalism
professionalism_metric = GEval(
    name="Professionalism",
    criteria=criteria,
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
)
```

:::info
**G-Eval is a two-step algorithm** that first uses chain-of-thought reasoning (CoTs) to generate a series of evaluation steps based on the specified `criteria`. It then applies these steps to assess the parameters provided in an `LLMTestCase` and calculate the final score.
:::

With the evaluation criteria defined and metrics selected, we can finally begin running evaluations in the following section.
